In 2017 Arifu designed an experiment to test whether learners preferred narrative or fact based training.
A fertilizer training was designed with 2 variations ‘THUMB’ for fact based and ‘NARRATIVE’ for narrative based.
Your task is to analyze between the two variations which was more popular.
NB: Make the assumption the length of the training does not matter.
In addition feel free to generate any interesting insights from the dataset. Your output will be a write up and code used. Both the findings and an explanation of your method should be provided
library(infer)
library(readr) #to load csv data.
library(dplyr) #data manipulation
library(ggplot2)
library(plotly)
library(DataExplorer)
library(naniar)
library(powerMediation)#power analysis
#library(pwr)
library(broom)
library(DT)
seed=2010
fertilizer_data<-read_csv('data/challenge 2 dataset (fertilizer).csv')
Introduce the data
introduce(fertilizer_data)
## # A tibble: 1 x 9
## rows columns discrete_columns continuous_colu~ all_missing_col~
## <int> <int> <int> <int> <int>
## 1 6731 7 6 1 0
## # ... with 4 more variables: total_missing_values <int>,
## # complete_rows <int>, total_observations <int>, memory_usage <dbl>
Plot the data introduction
plot_intro(fertilizer_data)
Look at the columns that have missing values
miss_var_summary(fertilizer_data)
## # A tibble: 7 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 user_response 266 3.95
## 2 message_out 116 1.72
## 3 message_in 20 0.297
## 4 learner_id 0 0
## 5 program_code 0 0
## 6 variation_code 0 0
## 7 created_at 0 0
Plot the missing data
plot_missing(fertilizer_data)
The columns that have missing data have less than 5% of their values missing and since we have relatively many observations we may just drop the observations that have these missing values.
# drop rows with missing values
fertilizer_data<-na.omit(fertilizer_data)
Look at the internal structure
glimpse(fertilizer_data)
## Observations: 6,352
## Variables: 7
## $ learner_id <dbl> 164274, 164274, 164274, 164274, 164274, 164274,...
## $ program_code <chr> "YARA", "YARA", "YARA", "YARA", "YARA", "YARA",...
## $ variation_code <chr> "NARRATIVE", "NARRATIVE", "NARRATIVE", "NARRATI...
## $ message_in <chr> "YARA", "A", "A", "1", "A", "A", "A", "A", "A",...
## $ message_out <chr> "(1/23) A healthy crop makes a wealthy farmer. ...
## $ created_at <chr> "11/2/2017 14:06", "11/2/2017 14:08", "11/2/201...
## $ user_response <chr> "A", "A", "1", "A", "A", "A", "A", "A", "2", "A...
fertilizer_data%>%
ggplot(aes(variation_code))+
geom_bar()
From the plot we can see that ‘THUMB’ for fact based training is more popular than ‘NARRATIVE’ for narrative based training.
But is the variation statistically signinficant? We will have to carry out A/B testing to prove this.
In 2017 Arifu designed an experiment to test whether learners preferred narrative or fact based training.
A fertilizer training was designed with 2 variations ‘THUMB’ for fact based and ‘NARRATIVE’ for narrative based.
Your task is to analyze between the two variations which was more popular.
H0: The difference in proportions of THUMB and NARRATIVE VARIATIONS is zero
HA: The proportion of users who preferred THUMB based training is greater than the proportion of users who preferred NARRATIVE base training.
Variation code
Let us determine the sample size
fertilizer_data%>%
mutate()%>%
select(variation_code)%>%table()%>%prop.table()
## .
## NARRATIVE THUMB
## 0.4458438 0.5541562
total_sample_size <- SSizeLogisticBin(
p1 = 0.4458438,# conversion rate in August for control group/condition
p2 = 0.5541562, # expected conversion rate in August for test group/condition, assuming a 10 percentage point increase
B = 0.5, # proportion of the sample data from the test condition/group (ideally 0.5)
alpha = 0.05, # significance level/p-value. The level of probability at which it is agreed that the null hypothesis will be rejected. Conventionally set at 0.05.
power = 0.8 # 1-Beta. The probability of rejecting the null hypothesis when it is false and the HA is true.
)
total_sample_size
## [1] 667
Now let us select a random sample of 667 trainings.
#set seed
set.seed(seed)
#generate 667 random observations
fertilizer_sample_data <- fertilizer_data%>%
select(variation_code)%>%
sample_n(667)
#view
glimpse(fertilizer_sample_data)
## Observations: 667
## Variables: 1
## $ variation_code <chr> "NARRATIVE", "NARRATIVE", "NARRATIVE", "THUMB",...
We can now observe the distributions of our two training categories.
ggplotly(
ggplot(data = fertilizer_sample_data, aes(x = variation_code)) +
geom_bar())
There seems to be a higher preference for THUMB trainings.
difference in proportions
observed_diff_in_prop<-fertilizer_sample_data%>%
group_by(variation_code)%>%
tally()%>%summarise(diff(n))%>%pull()
#observed_diff_in_prop
#p_hat <- fertilizer_sample_data %>%
# specify(response = variation_code, success = "THUMB") %>%
# calculate(stat = "prop")
#p_hat
diff_prop_data<-fertilizer_sample_data%>%
summarize(diff_in_prop=observed_diff_in_prop)
diff_prop_data
## # A tibble: 1 x 1
## diff_in_prop
## <int>
## 1 93
fertilizer_sample_data$constant<-1:nrow(fertilizer_sample_data)
#set.seed(seed)
#boot_dist_prop <-fertilizer_sample_data%>%
# specify(constant~variation_code, success='THUMB')%>%
# hypothesize(null = 'independence') %>%
# generate(reps = 10)%>%
# calculate(stat = 'prop', na.rm = TRUE, order = c('THUMB','NARRATIVE'))
##### test if sample is consistent with known population
#binom.test(x=fertilizer_sample_data$variation_code, p=0.513, alternative="greater")
#view
#glimpse(boot_dist_prop)
#unique(boot_dist_prop$stat)